Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 11 23:53:57 2023"

The text continues here.

I have really enjoyed the start of this course. As a PhD student, I feel that when starting my PhD I had to dive into the deep end of R, starting quite early on with for example single cell analysis, WES and WGS data and I did not really have so much time to focus on the basics.

I liked the set-up of the exercise set, it is user friendly and nice to go back to the basics. Some was quite repetitive and some types of plots I won’t ever use in my own research, but I guess it’s good to know in general.

I liked the trick about not leaving install.packages in the code but rather commenting it out and then selecting it later, this was useful.

I didn’t really understand the examples with the dates.

Pivot_longer was aslo useful, many times I striggle with this and try to google how to do it again.

I hope I will get back to the basics with this course, and I feel the exercise active workbook is a good resource for later on when you forget how to do something.

my github: https://github.com/nygr/IODS-project ***

Week 2 Exercise set - Analysis: Regression and model validation

#Petra Nygren

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 11 23:53:57 2023"

“Sun Nov 12 12:27:17 2023”

Read in data needed for exercise

# set working directory and read in previously created csv file
setwd("/Users/nygrpetr/Desktop/iods/iods_petra/")
my_data <- read.csv("data/learning2014.csv") 

The dataset we are using in this exercise consists of learning data of students. It contains the gender (male or female), age (numerical value) of the students. Variables “deep”, “strategic” and “surface” give information about the study approach the students used (deep learning, strategic learning or surface learning). The values are means, and variables having the value 0 have been omitted. Points tells us about the exam points received by that person and attitude a numerical value to represent the attitude of the person towards statistics. The dataset contains information in integer, numerical and character classes.

# explore dimensions dim() (number of rows and columns) and structure str() of the data
dim(my_data)
## [1] 166   8
str(my_data)
## 'data.frame':    166 obs. of  8 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender   : chr  "F" "M" "F" "M" ...
##  $ age      : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude : int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep     : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surface  : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ strategic: num  3.38 2.75 3.62 3.12 3.62 ...
##  $ points   : int  25 12 24 10 22 21 21 31 24 26 ...
#Remove extra variable "X"
my_data$X <- NULL

Besides the age of the subjects, other variable appear to have normal distrubtions. For age, most subjects are between 20 and 25 years and the data has a right-skewed distribution, median age is 22 years. Attitude, points are the most clearly normally distributed and symmetrical, whereas deep and surface have slight skews in their distributions.

Exam points and attitude have the most clear positive correlation - higher attitude points seem to correlate with higher exam points.

# install (if not yet installed) and load library 
# install.packages("ggplot2")
# install.packages("GGally")
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#Plots some histograms to check distributions of the data
hist(my_data$age)

hist(my_data$attitude)

hist(my_data$deep)

hist(my_data$surface)

hist(my_data$strategic)

hist(my_data$points)

#See summary of the data
summary(my_data)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :32.00   Median :3.667  
##                     Mean   :25.51   Mean   :31.43   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##     surface        strategic         points     
##  Min.   :1.583   Min.   :1.250   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:2.625   1st Qu.:19.00  
##  Median :2.833   Median :3.188   Median :23.00  
##  Mean   :2.787   Mean   :3.121   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:3.625   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :5.000   Max.   :33.00
#Graphical overview
#Pairs creates a matrix of plots showing correlations between numerical variables. The first variable, gender, is subsetted out because it is a non-numerical variable
pairs(my_data[,-1])

#Alternatively ggpairs can be used to create a plot matrix
ggpairs(my_data, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

From the graphical overview, we are able to see that attitude, strategic learning and surface learning are the top 3 varaibles which have a correlation with exam points. Of these, attitude has the strongest correlation and surface learning the weakest (visually inspected). Now we fit a linear model using these three variables and points as the response variable

# fit a model with three 
model <- lm(points ~ attitude + surface + strategic, data = my_data)

# print out summary
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + surface + strategic, data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## surface     -0.58607    0.80138  -0.731  0.46563    
## strategic    0.85313    0.54159   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
#### Only attitude is significant so we try again with only attitude
model_new <- lm(points ~ attitude, data = my_data)

# print out summary
summary(model_new)
## 
## Call:
## lm(formula = points ~ attitude, data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
Of the used input variables, only attitude towards statistics had a significant effect on the exam points so we ran a new model with only attitude towards statistics as input variable and exam points as response variable.When attitude increases, exam points increase by 0.35units.R-squared is a measure of how well the model fits the data. It ranges from 0 to 1, with higher values indicating a better fit. With this model, we get the result that R squared is 0.1906, so 19% of changes in exam points can be explained by our input, attitude variable. Thus, there are other variables that contribute around 80% to the exam points that are not available in this dataset. My conclusion from this linear model is that students with more positive attitudes towards statistics achieve higher exam points, but attitude alone is not enough to explain good exam success.



```r
# Diagnostic plots using plot() function
par(mfrow=c(2,2))
plot(model_new)

A linear model has several assumptions that must be met in order for the results to be valid. It assumes that: 1.the relationship between two variables in linear 2.the tested variables are independant 3. data is normally distributed and 4. Homoscedasticity: The residuals have constant variance at every level of the independent variable(s). This means that the variance of the residuals is the same across all levels of the independent variable(s).

Above, we used R’s plot() function to test if our data fills these assumptions.

The plot() function produces by default, four diagnostic plots by. The first plot is the Residuals vs Fitted plot, which is the plot of the residuals against the fitted values. This plot is used to check for non-linear patterns in the residuals. If there is a clear pattern in the residuals, it suggests that the linear regression model may not be appropriate for the data. In the plot we have plotted, there is no clear pattern so I think our data fills the assumption of linear relationship

The second plot is the QQ-plot, which is used to check for normal distribution. If the residuals are normally distributed, the points in the QQ-plot should fall along a straight line. If the points deviate from a straight line, it suggests that the residuals are not normally distributed. In our data, the points generally follow the straight line so I would say our data follows teh normal distrubution. There is some deviation toward the bottom and top of the plot but I would allow this.

The third plot is the Scale-Location plot, which is used to check for homoscedasticity. Homoscedasticity means that the variance of the residuals is constant across all levels of the predictor variables. In the plot, the red line represents the fitted values, and the blue line represents the square root of the absolute residuals. If the red line is roughly horizontal it suggests that the residuals have constant variance.In our plot, the red line I would intrepret is mostly horizontal so we can assume our data is homoscedastic.

The fourth plot is the Residuals vs Leverage plot, which is used to identify influential observations. An observation is influential if it has a large residual and/or a high leverage value. In general, an observation is influential if it has a large effect on the slope of the regression line. In the plot, the Cook’s distance is used to identify influential observations. If an observation has a Cook’s distance greater than 1, it is considered influential. In our data it seems there may be some influential outlier observations that have high effect on the line, so in order to produce an accurate result, outliers should be removed and the analysis repeated. However, in the interest of time I have not proceeded with those steps.

Information on the assumptions and plot types was found on 1. statology.org 2. geeksforgeeks.org 3. upgrad.com


Week 3 exercises

# set working directory and read in previously created txt file
setwd("/Users/nygrpetr/Desktop/iods/iods_petra/")
alc_data <- read.csv("data/alc_data.csv") 
print(colnames(alc_data))
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset we are using includes data on students’ achievement in secondary education of two schools in Portugal. The variables include student grades, demographic such as mother and father education, social and school related features. The two datasets concern different subjects: Mathematics (mat) and Portuguese language (por). We have added a column alc_use which is the average of weekday and weekend alcohol use.

Here, we aim to study the relationships between alcohol use and the variables in the data. The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.

I chose to study 4 variables: sex, Pstatus, Medu (mother’s education) and health. I hypothesize that 1) Male sex will correlate with higher alcohol use 2) Pstatus - students with parents who live together would use less alcohol 3) Students whose mothers are highly educated would use less alcohol 4) Students with better health status use less alchohol

Next, we will explore the distributions of the previously mentioned variables and their association with alcohol use

library(ggplot2)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggplot(alc_data, aes(sex)) + geom_bar()

ggplot(alc_data, aes(Pstatus)) + geom_bar()

ggplot(alc_data, aes(Medu)) + geom_bar()

ggplot(alc_data, aes(health)) + geom_bar()

In the dataset, there are more female students than male, most of the students’ parents live together, more students have highly educated mothers than lowly educated, and more students have the highest degree of health than the lowest degree

#Check what proprotions of students in each category (sex: M/F, Pstatus: T/A, Medu: high values to low values, health: low val to high) belong to high alcohol use group
alc_data %>% group_by(sex, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count  prop
##   <chr> <lgl>    <int> <dbl>
## 1 F     FALSE      154 0.790
## 2 F     TRUE        41 0.210
## 3 M     FALSE      105 0.6  
## 4 M     TRUE        70 0.4
alc_data %>% group_by(Pstatus, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'Pstatus'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   Pstatus [2]
##   Pstatus high_use count  prop
##   <chr>   <lgl>    <int> <dbl>
## 1 A       FALSE       26 0.684
## 2 A       TRUE        12 0.316
## 3 T       FALSE      233 0.702
## 4 T       TRUE        99 0.298
alc_data %>% group_by(Medu, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'Medu'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups:   Medu [5]
##     Medu high_use count  prop
##    <int> <lgl>    <int> <dbl>
##  1     0 FALSE        1 0.333
##  2     0 TRUE         2 0.667
##  3     1 FALSE       31 0.633
##  4     1 TRUE        18 0.367
##  5     2 FALSE       78 0.812
##  6     2 TRUE        18 0.188
##  7     3 FALSE       57 0.613
##  8     3 TRUE        36 0.387
##  9     4 FALSE       92 0.713
## 10     4 TRUE        37 0.287
alc_data %>% group_by(health, high_use) %>% summarise(count = n()) %>% mutate(prop = count / sum(count))
## `summarise()` has grouped output by 'health'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups:   health [5]
##    health high_use count  prop
##     <int> <lgl>    <int> <dbl>
##  1      1 FALSE       35 0.761
##  2      1 TRUE        11 0.239
##  3      2 FALSE       28 0.667
##  4      2 TRUE        14 0.333
##  5      3 FALSE       61 0.762
##  6      3 TRUE        19 0.238
##  7      4 FALSE       45 0.726
##  8      4 TRUE        17 0.274
##  9      5 FALSE       90 0.643
## 10      5 TRUE        50 0.357
ggplot(data = alc_data, aes(x = sex, fill = high_use)) +
  geom_bar(position = "dodge")

ggplot(data = alc_data, aes(x = Pstatus, fill = high_use)) +
  geom_bar(position = "dodge")

ggplot(data = alc_data, aes(x = Medu, fill = high_use)) +
  geom_bar(position = "dodge")

ggplot(data = alc_data, aes(x = health, fill = high_use)) +
  geom_bar(position = "dodge")

My prediction of sex and alcohol use seems to have some truth behind it, with Pstatus there seems to be equal proportions of high and low alc use in children of parents who cohabitate and thsoe who do not. Contrary to my prediction, based on visual analysis of the data it seems that students with more highly educated mothers seem to drink more alcohol and those with better health seem to also drink more alcohol so I doubt my hypothesis’ concerning those.

Next, we will run a regression model to test the relationships between the variables and alcohol use. We will process factor variables and categorical variables in a different way.

#Turn Medu and health into factors to enable them to be used in the model
alc_data$Medu <- factor(alc_data$Medu)
alc_data$health<- factor(alc_data$health)

#Run regression model
my.mod <- glm(high_use ~ sex + Pstatus + Medu + health, data = alc_data, family = "binomial")

summary(my.mod)
## 
## Call:
## glm(formula = high_use ~ sex + Pstatus + Medu + health, family = "binomial", 
##     data = alc_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.03050    1.35266   0.023  0.98201    
## sexM         0.92338    0.24456   3.776  0.00016 ***
## PstatusT    -0.09829    0.39260  -0.250  0.80232    
## Medu1       -1.12776    1.29505  -0.871  0.38385    
## Medu2       -2.16404    1.28249  -1.687  0.09153 .  
## Medu3       -1.16803    1.27388  -0.917  0.35919    
## Medu4       -1.69835    1.27132  -1.336  0.18158    
## health2      0.55552    0.50095   1.109  0.26745    
## health3      0.02365    0.45133   0.052  0.95821    
## health4      0.28251    0.46379   0.609  0.54243    
## health5      0.44817    0.40569   1.105  0.26929    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 420.13  on 359  degrees of freedom
## AIC: 442.13
## 
## Number of Fisher Scoring iterations: 4
#Compute odds ratios and confidence intervals, print the results
OR <- coef(my.mod) %>% exp
CI <- confint(my.mod) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR       2.5 %    97.5 %
## (Intercept) 1.0309650 0.076881229 25.754772
## sexM        2.5177759 1.566506913  4.093497
## PstatusT    0.9063878 0.427092627  2.013112
## Medu1       0.3237572 0.013860043  3.858076
## Medu2       0.1148602 0.004990072  1.336025
## Medu3       0.3109786 0.013654668  3.562161
## Medu4       0.1829843 0.008056463  2.084139
## health2     1.7428548 0.655437479  4.730204
## health3     1.0239323 0.426979687  2.535839
## health4     1.3264582 0.538881788  3.360685
## health5     1.5654372 0.722581102  3.586186

Of our 4 chosen variables, only sex had a statistically significant effect on alcohol use. Male sex was associated with increase of high_use response variable being TRUE (high_use) by 0.90043. Odds ratio is a “relative measure of effect”, which allows the comparison of the test group of a study relative to the control, variables with odds ratio = 1 do not have a statistically significant effect on the output variable , the odds ratio for sex is the only one which is above 1, being 2.46 with confidence intervals of 1.5517890 to 3.9421063.

Next we will use only the variable “sex” which was the only significant one to fit a new model, and use it to predict probabilities of high use tehn compare it to the actual values to investigate the predictive power of the model.

new.mod <- glm(high_use ~ sex, data = alc_data, family = "binomial")
probabilities <- predict(new.mod, type = "response")

#Add column for probabilities, use it to make predictions of high use, and get teh last 10 values
alc_data <- mutate(alc_data, probability = probabilities)
alc_data <- mutate(alc_data, prediction = probability > 0.5)
select(alc_data, sex, high_use, probability, prediction) %>% tail(10)
##     sex high_use probability prediction
## 361   M    FALSE   0.4000000      FALSE
## 362   M    FALSE   0.4000000      FALSE
## 363   M     TRUE   0.4000000      FALSE
## 364   F    FALSE   0.2102564      FALSE
## 365   F    FALSE   0.2102564      FALSE
## 366   F    FALSE   0.2102564      FALSE
## 367   F    FALSE   0.2102564      FALSE
## 368   F    FALSE   0.2102564      FALSE
## 369   M     TRUE   0.4000000      FALSE
## 370   M     TRUE   0.4000000      FALSE
#Make a table
table(high_use = alc_data$high_use, prediction = alc_data$prediction) 
##         prediction
## high_use FALSE
##    FALSE   259
##    TRUE    111
prop.table(table(high_use = alc_data$high_use, prediction = alc_data$prediction)) * 100
##         prediction
## high_use FALSE
##    FALSE    70
##    TRUE     30

I would say the the model is good but not great for this data, 259 observations with FALSE for high use corresponded to the predictions of FALSE, but the model misstakenly predicted FALSE for 111 high use values which were actually true, so I would not use this model as the predictive power is subpar. ***

# Week 4 exercises

output: html_document date: “2023-11-26” —

Here, we use the package “MASS” to load in dataset “Boston”, which contains information about Housing Values in Suburbs of Boston. This dataframe has 506 rows and 14 columns. Contains numeric and integer variables.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

Here we show a graphical overview of the data and show summaries of the variables in the data. We want to use the data in a long format and for this we use melt(), then we examine the distributions of the variables and the correlations between them.

We see that most of the variables are not normally distributed, rm seems to be the most normally distrubted variable. Other variables have two peaked distributions (e.g indus) and some are left skewed (e.g. dis), and other right skewed (e.g. ptratio).

From the correlation plots we can see that istat and medv, age and dis, nox and dis, indus and dis are all strongly negatively correlated. Rad and tax are strongly positively correlated. The plot is counterintuitive because according to the legend red is -1 (neg) and blue is 1 (pos)

library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(dplyr)
require(reshape2)
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
data_boston_long <- melt(Boston)
## No id variables; using all as measure variables
ggplot(data = data_boston_long, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")

mat_Boston <- cor(Boston) %>% round(2)
corrplot(mat_Boston, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

We then scale the data and observing the summary of the new data, we see that the means of the new data are 0 when before they ranged from 0-68. Then we create a variable for crime and add it to the data. Finally we divide the data to test and train sets

# center and standardize variables
scaled_data <- scale(Boston)

# print summaries of the scaled variables
summary(scaled_data) # note that now all means are 0 as supposed to 
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Object is matrix so change to df
scaled_dat_df <- as.data.frame(scaled_data)

# Create categorical "crime" variable and add it to the dataset
bins <- quantile(scaled_dat_df$crim)
crime <- cut(scaled_dat_df$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
scaled_dat_df <- dplyr::select(scaled_dat_df, -crim)
scaled_dat_df <- data.frame(scaled_dat_df, crime)

#Create train and test sets of the data such that 80% of the data is in the train set
n <- nrow(scaled_dat_df)
ind <- sample(n,  size = n * 0.8)
train <- scaled_dat_df[ind,]
test <- scaled_dat_df[-ind,]

Next, we fit a linear discriminant analysis (lda) on the train set. We are using our created catergorical crime rate as the target variable and all the other variables in the dataset as predictor variables, and then we draw a bi-plot.

#Fit lda on train set
lda.fit <- lda(crime ~ . , data = train)

# Draw bi-plot
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

We used the test data to predict crime rates and then compared it to the true rates. In the table we see that the analysis performs well for the low and high crime rate categories, especially is predcited correctly all for the high crime rate category but in the med_low and med_high it seems to perform more poorly.

#Save crime rates from the test data set and then drop the varibale from test df
test_classes <- test$crime
test <- dplyr::select(test, -crime)

#Predict classes with test data and cross tabulate to compare with the real classes
lda.prediction <- predict(lda.fit, newdata = test)
table(correct = test_classes, predicted = lda.prediction$class)
##           predicted
## correct    low med_low med_high high
##   low       10      13        2    0
##   med_low    4      19        4    0
##   med_high   0       8       16    2
##   high       0       0        0   24

We reload the data, calculate euclidian distance and run k means. Optimal number of clusters I would say is 6, as this is where the curve flattens off

# Load and scale data, caluclate euclidian distances
data("Boston")
scaled_data <- scale(Boston)
euclidian_dist <- dist(scaled_data)

# Determine the number of clusters and run k means
set.seed(246)
k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(scaled_data, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

km <- kmeans(scaled_data, centers = 6)

# Visualize results
pairs(scaled_data, col = km$cluster)


# Week 5 Exercises

output: html_document date: “2023-12-03” — First we read in the data and explore structure and dimensions, then we set country names to rownames. After that we show a graphical overview and observe distributions and correlations between the variables.

#Read in data, explore, and make country col -> rownames
human <- read.csv("human_filtered.csv")
dim(human)
## [1] 155  10
#There is an extra column for row number "X" so lets remove that
human$X <- NULL
str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ country           : chr  "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ f_m_edu           : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ f_m_labour        : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ expected_edu      : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ life_exp          : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni               : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ maternal_mortality: int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adolesc_birth     : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ repr_parl         : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
rownames(human) <- human$country
human$country <- NULL

#Graphical overview
library(GGally) 
ggpairs(human)

summary(human)
##     f_m_edu         f_m_labour      expected_edu      life_exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni         maternal_mortality adolesc_birth      repr_parl    
##  Min.   :   581   Min.   :   1.0     Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5     1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0     Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1     Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0     3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0     Max.   :204.80   Max.   :57.50

Observing the results of ggpairs, we see that only expected years of education (expected_edu) is the only variable following a normal distribution, other variables seem to be skewed. Life expectancy and expected years of education are strongly correlated positively, other positively correlated pairs of variables are expected edu and gni, maternal mortality ratio and adolescent birth rate.

Next we perform a principal component analysis which is a form of dimensionality reduction and tells us how much of the variablity of the data is explained by some of the variables, or principal components and draw a biplot to visualize the results.

# PCA and summary
pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(1*s$importance[2, ], digits = 5)
print(pca_pr)
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# Biplot
biplot(pca_human, cex = c(1, 1), col = c("grey40", "blue"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

The PCA with the raw data shows that one variable, gni, explains 99% of the variability in the data.

Next we standardize the data and repeat the analysis

# Scale data
human_scaled <- scale(human)

# Repeat PCA with scaled data
pca_human_s <- prcomp(human_scaled)
s_scaled <- summary(pca_human_s)
pca_pr_scaled <- round(1*s_scaled$importance[2, ], digits = 5)
print(pca_pr_scaled)
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
pc_lab_scaled <- paste0(names(pca_pr_scaled), " (", pca_pr, "%)")

# Biplot
fig.width=10
biplot(pca_human_s, cex = c(0.8, 1), col = c("grey40", "blue"), xlab = pc_lab_scaled[1], ylab = pc_lab_scaled[2])

With the scaled data, PC1 explains 53% of the variability and PC2 explains 16% of the variability. This differs from the results of raw data, where it seemed like PC1 explained all of the variability (99%). In the raw data, gni so gross national income explained all the variability. After standardization, female to male labour ratio and Percent Representation in Parliament explained variability on the y axis, and other variables explain the variance of x axis so PC1.

Now we continue with tea dataset from FactoMine package, we explore the data and then perform an MCA analysis

#Load data, explore structure and dimensions
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
#Visualize the data
library(ggplot2)
library(dplyr)
library(tidyr)
#select some columns to visualize because there are many hard to visualise and imo some useless columns
tea_cols <- dplyr::select(tea, Tea, How, sugar, where, healthy)
pivot_longer(tea_cols, cols = everything()) %>% 
  ggplot(aes(value)) + geom_bar() + facet_wrap("name", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Now we perform Multiple Correspondence Analysis (MCA) on the tea data.

# MCA
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
mca_tea <- MCA(tea_cols, graph = FALSE)
summary(mca_tea)
## 
## Call:
## MCA(X = tea_cols, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.276   0.264   0.240   0.203   0.202   0.181   0.167
## % of var.             15.336  14.659  13.327  11.265  11.198  10.080   9.279
## Cumulative % of var.  15.336  29.995  43.322  54.587  65.785  75.865  85.144
##                        Dim.8   Dim.9
## Variance               0.144   0.124
## % of var.              7.980   6.876
## Cumulative % of var.  93.124 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           |  0.126  0.019  0.014 | -0.175  0.039  0.027 | -0.296  0.122
## 2           |  0.622  0.467  0.221 | -0.280  0.099  0.045 | -0.461  0.295
## 3           | -0.064  0.005  0.007 | -0.286  0.104  0.136 | -0.261  0.095
## 4           | -0.411  0.204  0.267 |  0.105  0.014  0.017 | -0.205  0.058
## 5           | -0.417  0.210  0.177 | -0.168  0.036  0.029 | -0.361  0.181
## 6           | -0.064  0.005  0.007 | -0.286  0.104  0.136 | -0.261  0.095
## 7           | -0.064  0.005  0.007 | -0.286  0.104  0.136 | -0.261  0.095
## 8           |  0.622  0.467  0.221 | -0.280  0.099  0.045 | -0.461  0.295
## 9           |  0.150  0.027  0.011 |  0.536  0.362  0.137 | -0.195  0.053
## 10          |  0.889  0.955  0.507 | -0.150  0.028  0.014 | -0.078  0.008
##               cos2  
## 1            0.078 |
## 2            0.121 |
## 3            0.113 |
## 4            0.067 |
## 5            0.132 |
## 6            0.113 |
## 7            0.113 |
## 8            0.121 |
## 9            0.018 |
## 10           0.004 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test
## black       |   1.053  19.828   0.363  10.422 |  -0.309   1.788   0.031  -3.060
## Earl Grey   |  -0.355   5.888   0.228  -8.254 |   0.410   8.195   0.303   9.521
## green       |  -0.283   0.640   0.010  -1.723 |  -1.704  24.216   0.359 -10.360
## alone       |  -0.229   2.473   0.098  -5.400 |  -0.362   6.441   0.243  -8.520
## lemon       |   0.088   0.062   0.001   0.535 |   1.161  11.230   0.166   7.055
## milk        |   0.165   0.415   0.007   1.472 |   0.374   2.226   0.037   3.334
## other       |   3.486  26.416   0.376  10.601 |   0.961   2.099   0.029   2.922
## No.sugar    |   0.440   7.237   0.207   7.861 |  -0.486   9.239   0.252  -8.684
## sugar       |  -0.470   7.736   0.207  -7.861 |   0.519   9.876   0.252   8.684
## chain store |  -0.302   4.242   0.163  -6.973 |  -0.207   2.079   0.076  -4.773
##                 Dim.3     ctr    cos2  v.test  
## black       |  -0.277   1.573   0.025  -2.736 |
## Earl Grey   |  -0.053   0.151   0.005  -1.231 |
## green       |   0.930   7.937   0.107   5.655 |
## alone       |  -0.162   1.414   0.048  -3.806 |
## lemon       |   1.931  34.198   0.461  11.739 |
## milk        |  -0.427   3.198   0.049  -3.810 |
## other       |  -0.589   0.868   0.011  -1.791 |
## No.sugar    |  -0.066   0.189   0.005  -1.183 |
## sugar       |   0.071   0.202   0.005   1.183 |
## chain store |  -0.432   9.939   0.331  -9.950 |
## 
## Categorical variables (eta2)
##               Dim.1 Dim.2 Dim.3  
## Tea         | 0.364 0.451 0.116 |
## How         | 0.405 0.290 0.476 |
## sugar       | 0.207 0.252 0.005 |
## where       | 0.224 0.306 0.590 |
## healthy     | 0.181 0.020 0.012 |
# Visualize
plot(mca_tea, invisible=c("ind"), graph.type = "classic", habillage = "quali")

fviz_mca_biplot(mca_tea, 
               repel = TRUE, # Avoid text overlapping (slow if many point)
               ggtheme = theme_minimal())

From the results we look at the variables with positive dimensions such as black, no sugar, other, lemon, milk which are the most linked, and negative varibales are the least linked such as green, tea shop, chain store. Variance explained by dimension one and dimension two was 15.34% and 14.66%, respectively. No clear clusters appear on the factor map.

# Week 6 Exercises

output: html_document date: “2023-12-11” — Longitudinal data

Part 1: RATS

RATS dataset is data from a study conducted on 3 groups of rats where rats were fed different diets and body weights were used to measure the growth of the rats, the objective was to find out whether the different diets had effects on rat growth

#We read in the data and convert id and group to factors, convert to long format
library(dplyr)
library(ggplot2)
library(tidyr)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

RATS_long <- pivot_longer(RATS, cols = -c(ID, Group), 
                      names_to = "groups",
                      values_to = "wd") %>% 
  mutate(Time = as.integer(substr(groups, 3, 4))) %>%
  arrange(Time)

Next we visualize the data, we see that the rats all have a tendency to grow during the experiment. The initial bodyweights of rats in group 3 seem to be the highest and in group 1 the lowest. Based on this it seems like there is an outlier in group 2 but lets check that later.

ggplot(RATS_long, aes(x = Time, y = wd, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times = 4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS_long$wd), max(RATS_long$wd)))

There appears to be a tracking phenomenon where rats with higher weight in the beginning seem to gain more weight and thus have higher weights throughout the study - a tracking phenomenon. In an ideal situation the rats would be split into groups where each group has a similar weight distribution not that there are huge differences in the starting weights. Could there be some biological reason why some mice are heavier to start off with and thus more likely to gain weight -something to do with genetics perhaps?

Now we standardize the data in order to more clearly see the tracking phenomenon

library(dplyr)
library(tidyr)

#Now we standardize and repeat the plot
RATS_long <- RATS_long %>%
  group_by(Time) %>%
  mutate(stdwd = scale(wd)) %>%
  ungroup()

ggplot(RATS_long, aes(x = Time, y = stdwd, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times = 4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized wd")

Lets now plot the means and standard deviations for each treatment group at each time point. The following line plot is difficult to interpret because the baseline weights are so different, but it seems, on average over the course of the diet time group 2 rats gained the most weight. However, group 2 has the most standard deviations, which could possibly be due to the possible outlier (rat number 12), but since there are so few rats in that group lets just leave it for now.

n <- table(RATS_long$Group)

RATSS <- RATS_long %>%
  group_by(Group, Time) %>%
  summarise(mean =mean(wd), se = sd(wd)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, colour = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.4)) +
  scale_y_continuous(name = "mean rat weight +/- sd(rat weight)")

Now we calculate a summary measure, removing observation from the baseline so from week 1 and draw boxplots with it. From these boxplots we see that in addition to the outlier in group 2 there are also outliers in groups 1 and 3. Since the outlier in group 2 is very obvious, moreso than groups 1 and 3 lets remove it, also its easier to remove.

RATS_filt <- RATS_long %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean = mean(wd)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATS_filt, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape = 23, size = 4, fill = "white") +
  scale_y_continuous(name = "mean(wd), weeks 1-9")

RATS_filt <- RATS_filt %>% filter(mean < 580)

Based on visualisation it seems that there is a difference between the diets offered to the rats, but we need to test this with a statistical test. Based on the t test, all of the groups tested against each other have significant differences. However, I would not trust this result very much as there were very few samples per group, some outliers existed and the baseline characteristics of the rats were different which may have underlying differences in biology / genetics / metabolics

t.test(mean ~ Group, data = filter(RATS_filt,Group!=3), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -25.399, df = 9, p-value = 1.094e-09
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -204.0633 -170.6867
## sample estimates:
## mean in group 1 mean in group 2 
##         265.025         452.400
t.test(mean ~ Group, data = filter(RATS_filt,Group!=2), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3 
##         265.025         527.500
t.test(mean ~ Group, data = filter(RATS_filt,Group!=1), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -109.37769  -40.82231
## sample estimates:
## mean in group 2 mean in group 3 
##           452.4           527.5

Part 2: BPRS

This dataset contains data on 40 subject and their rating on the brief psychiatric rating scale (used to evaluate schizophrenia) before and after treatment (split into 2 treatment groups). The scale assessed different symptoms such as hallucinations from a scale of 1(low) to 7(high).

First we read in the data and re-factor the variables

BPRS_long <- read.csv("/Users/nygrpetr/Desktop/iods/iods_petra/data/bprs_long.csv")

BPRS_long$treatment <- factor(BPRS_long$treatment)
BPRS_long$subject <- factor(BPRS_long$subject)

glimpse(BPRS_long)
## Rows: 360
## Columns: 6
## $ X         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Next we visualize the data, upon visualization there may be an outlier in treatment group 2 but overall they appear to be quite similar

ggplot(BPRS_long, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS_long$bprs), max(BPRS_long$bprs)))

Next, we will fit a linear regression model and ignore the fact that the observations are from the same subject but assume they are separate. With this model we see that treatment does not seem to have an effect but time does have a significant effect

fit <- lm(bprs ~ treatment + week, data = BPRS_long)
summary(fit)
## 
## Call:
## lm(formula = bprs ~ treatment + week, data = BPRS_long)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Next, because the observations are not actually independant, we fit a random intercept model with ID as the the random effect

#install.packages("lme4")
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRS_random <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_long, REML = FALSE)
summary(BPRS_random)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS_long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Next we try a random intercept and random slope model on the data

# create a random intercept and random slope model
BPRS_random1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_long, REML = FALSE)
summary(BPRS_random1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS_long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

Then we use anova to test which model fits the data better,

anova(BPRS_random1, BPRS_random)
## Data: BPRS_long
## Models:
## BPRS_random: bprs ~ week + treatment + (1 | subject)
## BPRS_random1: bprs ~ week + treatment + (week | subject)
##              npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_random     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_random1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The random intercept and random slope model has significant p value and thus fits the data better.

Next we fit a linear mixed random slope and intercept model with interaction between time and treatment. Then again we use anova to test which one fits the data better, the improvement doesnt really do much, no significant difference

BPRS_random2 <- lmer(bprs ~ week + treatment + (week | subject) + week * treatment, data = BPRS_long, REML = FALSE)
summary(BPRS_random2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
##    Data: BPRS_long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_random2, BPRS_random1)
## Data: BPRS_long
## Models:
## BPRS_random1: bprs ~ week + treatment + (week | subject)
## BPRS_random2: bprs ~ week + treatment + (week | subject) + week * treatment
##              npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_random1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_random2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now lets plot the fitted values, with the fitted values it seems that the BPRS scores of both groups seem to decrease with time thus both treatments are working to reduce schizophrenic symptoms.

BPRS_long$fit <- fitted(BPRS_random2)  

ggplot(BPRS_long, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(color = treatment)) +
  scale_x_continuous(name = "Time", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "top")

ggplot(BPRS_long, aes(x = week, y = fit, group = subject)) +
  geom_line(aes(color = treatment)) +
  scale_x_continuous(name = "Time", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Fitted BPRS") +
  theme(legend.position = "top")